Simple Linear Regression I

EC 320, Set 03

Andrew Dickinson

Spring 2024

Admin

PS01:

  • Due Tuesday at 11:59p
  • Covers remaining topics from the review and from lecture Wednesday

Koans K01, K02, and K03:

  • Due Friday at 5:00p

Reading: (up to this point)

ItE: R, 1

So far we’ve identified the fundamental problem econometricians face. How do we proceed? Regressions!

  • Running models
  • Confounders
  • Omitted Variable Bias

Regression logic

Regression models

Modeling is about reducing something really complicated into something simple that represents some part of the complicated reality.

  • Try to tell stories that are easy to understand, and easy to learn from
  • Model toy versions of reality

Economists often rely on linear regression for statistical comparisons.

  • Describes the relationship between a dependent (endogenous) variable and one or more independent (exogenous) variable(s)
  • “Linear” is more flexible than you think

Regression models

Regression analysis helps us make all else equal comparisons


Running regressions provide correlative (and even causal) information between two variables


Ex. By how much does \(Y\) change when \(X\) increases by one unit?

Regression models

Modelling forces us to be explicit about the potential sources of bias

  • Model the effect of \(X\) on \(Y\) while controlling for potential confounders that may muddy the water
  • Failure to account for sources of bias, leads to biased estimates.

Ex. Not controlling for confounding variables, leads to omitted-variable bias, a close cousin of bias

  • Why? Omitted variables that correlate with our covariate of interest, hides within the model, distorting our results.

Returns to education

Research Question: By how much does an additional year of schooling increase future earnings?

  • Dependent variable: Earnings
  • Independent variable: An additional year of school

Q. How might education increase earnings?

Q. Why might a simple comparison between high and low educated not isolate the economic returns to education?

More education (X) increases lifetime earnings (Y)



More education (X) increases lifetime earnings (Y) along with a lot of other things (U).



More education (X) increases lifetime earnings (Y) along with a lot of other things (U). But a lot of other things (U) also impact education (X).

Any unobserved variable that connects a backdoor path between education (X) and earnings (Y) is called a confounder

Returns to education

How might we estimate the causal effect of an additional year of schooling on earnings?

Approach 1: Compare average earnings of individuals who have 12 years of education to those with 16

  • Prone to bias by variety of confounding variables

Approach 2: Estimate a regression that compares the earnings of individuals with the same profiles

  • Try to control for confounders by including them in the model
  • Race, parental income, geography etc.

But before taking on confounders and using regression to link causal relationships… let’s breakdown the anatomy of the simple regression model

Simple regression model

The regression model

We can estimate the effect of \(X\) on \(Y\) by estimating a regression model:

\[Y_i = \beta_0 + \beta_1 X_i + u_i\]

  • \(Y_i\) is the dependent variable
  • \(X_i\) is the independent variable (continuous)
  • \(\beta_0\) is the intercept parameter. \(E\left[ {Y_i | X_i=0} \right] = \beta_0\)
  • \(\beta_1\) is the slope parameter, which under the correct causal setting represents marginal change in \(X_i\)’s effect on \(Y_i\). \(\frac{\partial Y_i}{\partial X_i} = \beta_1\)
  • \(u_i\) is an error term including all other (omitted) factors affecting \(Y_i\).

The error term

\(u_i\) is quite special


Consider the data generating process of variable \(Y_i\),

  • \(u_i\) captures all unobserved variables that explain variation in \(Y_i\).


Some error will exist in all models, our aim is to minimize error under a set of constraints

  • Error is the price we are willing to accept for simplified model

The error term

Five items contribute to the existence of the disturbance term:

1. Omission of independent variables

  • Our description (model) of the relationship between \(Y\) and \(X\) is a simplification
  • Other variables have been left out (omitted)

The error term

Five items contribute to the existence of the disturbance term:

1. Omission of independent variables

2. Aggregation of Variables

  • Microeconomic relationships are often summarized
  • Ex. Housing prices (\(X\)) are described by county-level median home value data

The error term

Five items contribute to the existence of the disturbance term:

1. Omission of independent variables

2. Aggregation of Variables

3. Model misspecificiation

  • Model structure is incorrectly specified
  • Ex. \(Y\) depends on the anticipated value of \(X\) in the previous period, not \(X\)

The error term

Five items contribute to the existence of the disturbance term:

1. Omission of independent variables

2. Aggregation of Variables

3. Model misspecificiation

4. Functional misspecificiation

  • The functional relationship is specified incorrectly
  • True relationship is nonlinear, not linear

The error term

Five items contribute to the existence of the disturbance term:

1. Omission of independent variables

2. Aggregation of Variables

3. Model misspecificiation

4. Functional misspecificiation

5. Measurement error

  • Measurement of the variables in the data is just wrong
  • \(Y\) or \(X\)

The error term

Five items contribute to the existence of the disturbance term:

1. Omission of independent variables

2. Aggregation of Variables

3. Model misspecificiation

4. Functional misspecificiation

5. Measurement error

Running regressions

Using an estimator with data on \(X_i\) and \(Y_i\), we can estimate a fitted regression line:

\[ \hat{Y_i} = \hat{\beta}_0 + \hat{\beta}_1 X_i \]

  • \(\hat{Y_i}\) is the fitted value of \(Y_i\).
  • \(\hat{\beta}_0\) is the estimated intercept.
  • \(\hat{\beta}_1\) is the estimated slope.

This procedure produces misses, known as residuals, \(Y_i - \hat{Y_i}\)

I think it would be easier to think about regression with a concrete example.

Ex. Effect of police on crime

Ex. Effect of police on crime

Empirical question:

Does the number of on-campus police officers affect campus crime rates? If so, by how much?


Always plot your data first

Ex. Effect of police on crime

The scatter plot suggest that a weak positive relationship exists

  • A sample correlation of 0.14 confirms this


But correlation does not imply causation


Lets estimate a statistical model

Ex. Effect of police on crime

We express the relationship between a dependent variable and an independent variable as linear:

\[ {\color{#81A1C1} \text{Crime}_i} = \beta_1 + \beta_2 {\color{#B48EAD} \text{Police}_i} + u_i. \]

  • \(\beta_1\) is the intercept or constant.

  • \(\beta_2\) is the slope coefficient.

  • \(u_i\) is an error term or disturbance term.

Ex. Effect of police on crime

The intercept tells us the expected value of \(\text{Crime}_i\) when \(\text{Police}_i = 0\).

\[ \text{Crime}_i = {\color{#BF616A} \beta_1} + \beta_2\text{Police}_i + u_i \]

Usually not the focus of an analysis.

Ex. Effect of police on crime

The slope coefficient tells us the expected change in \(\text{Crime}_i\) when \(\text{Police}_i\) increases by one.

\[ \text{Crime}_i = \beta_1 + {\color{#BF616A} \beta_2} \text{Police}_i + u_i \]

“A one-unit increase in \(\text{Police}_i\) is associated with a \(\color{#BF616A}{\beta_2}\)-unit increase in \(\text{Crime}_i\).”

Interpretation of this parameter is crucial

Under certain (strong) assumptions1, \(\color{#BF616A}{\beta_2}\) is the effect of \(X_i\) on \(Y_i\).

  • Otherwise, it’s the association of \(X_i\) with \(Y_i\).

Ex. Effect of police on crime

The error term reminds us that \(\text{Police}_i\) does not perfectly explain \(Y_i\).

\[ \text{Crime}_i = \beta_1 + \beta_2\text{Police}_i + {\color{#BF616A} u_i} \]

Represents all other factors that explain \(\text{Crime}_i\).

  • Useful mnemonic: pretend that \(u\) stands for “unobserved” or “unexplained.”

Ex. Effect of police on crime

How might we apply the simple linear regression model to our question about the effect of on-campus police on campus crime?

\[ \text{Crime}_i = \beta_1 + \beta_2\text{Police}_i + u_i. \]

  • \(\beta_1\) is the crime rate for colleges without police.
  • \(\beta_2\) is the increase in the crime rate for an additional police officer per 1000 students.

Ex. Effect of police on crime

How might we apply the simple linear regression model to our question?

\[ \text{Crime}_i = \beta_1 + \beta_2\text{Police}_i + u_i \]

\(\beta_1\) and \(\beta_2\) are the unobserved population parameters we want


We estimate

  • \(\hat{\beta_1}\) and \(\hat{\beta_2}\) generate predictions of \(\text{Crime}_i\) called \(\widehat{\text{Crime}_i}\).

  • We call the predictions of the dependent variable fitted values.

  • Together, these trace a line: \(\widehat{\text{Crime}_i} = \hat{\beta_1} + \hat{\beta_2}\text{Police}_i\).

So, the question becomes, how do I pick \(\hat{\beta_1}\) and \(\hat{\beta_2}\)

Let’s take some guesses: \({\color{#ffffff} \hat{\beta_1} = 60}\)


Let’s take some guesses: \(\hat{\beta_1} = 60\) and \(\hat{\beta_2}=-7\)

Does this line represent the data well?

Let’s take some guesses: \(\hat{\beta_1} = 30\) and \(\hat{\beta_2}=0\)

What about this one?

Let’s take some guesses: \(\hat{\beta_1} = 15.6\) and \(\hat{\beta_2}=7.94\)

Or this one?

Residuals

Using \(\hat{\beta_1}\) and \(\hat{\beta_2}\) to make \(\hat{Y_i}\) generates misses.

We call these misses residuals:

\[ {\color{#BF616A} \hat{u}_i} = {\color{#BF616A}Y_i - \hat{Y_i}}. \]

AKA \({\color{#BF616A}e_i}\).

\(\hat{\beta_1} = 15.4\) and \(\hat{\beta_2}=7.94\)

Does this line represent the data well?

Residuals

What is we picked an estimator that minimizes the residuals?

Why not minimize

\[ \sum_{i=1}^n \hat{u}_i^2 \]

so that the estimator makes fewer big misses?

This estimator, the residual sum of squares (RSS), is convenient because squared numbers are never negative so we can minimize an absolute sum of the residuals

RSS gives bigger penalties to bigger residuals.

RSS gives bigger penalties to bigger residuals.

RSS gives bigger penalties to bigger residuals.

Minimizing RSS

We could test thousands of guesses of \(\beta_1\) and \(\beta_2\) and pick the pair the has the smallest RSS

What if we did that. Let’s write a loop that guesses \(\beta_1\) and \(\beta_2\) 100,000 times and collect the RSS for each guess.

Then we plot it in three dimensions: \(\beta_1\) (x), \(\beta_2\) (y), RSS (z)

Simulated RSS across \(\beta_1, \beta_2 \sim \text{uniform}(-100, 100)\)

Simulated RSS across \(\beta_1, \beta_2 \sim \text{uniform}(-100, 100)\) (zoomed in)

Minimizing RSS

We could test thousands of guesses of \(\beta_1\) and \(\beta_2\) and pick the pair the has the smallest RSS

What if we did that. Let’s write a loop that guesses \(\beta_1\) and \(\beta_2\) 100,000 times and collect the RSS for each guess.

Then we plot it in three dimensions: \(\beta_1\) (x), \(\beta_2\) (y), RSS (z)




Or… We could just do a little math

Ordinary least squares

OLS

The OLS estimator chooses the parameters \(\hat{\beta_1}\) and \(\hat{\beta_2}\) that minimize the residual sum of squares (RSS):

\[ \min_{\hat{\beta}_1,\, \hat{\beta}_2} \quad \color{#BF616A}{\sum_{i=1}^n \hat{u}_i^2} \]

This is why we call the estimator ordinary least squares.

OLS Formulas

For details, see the handout posted on Canvas.

Slope coefficient

\[ \hat{\beta}_2 = \dfrac{\sum_{i=1}^n (Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i=1}^n (X_i - \bar{X})^2} \]

Intercept

\[ \hat{\beta}_1 = \bar{Y} - \hat{\beta}_2 \bar{X} \]

Slope coefficient

The slope estimator is equal to the sample covariance divided by the sample variance of \(X\):

\[ \begin{aligned} \hat{\beta}_2 &= \dfrac{\sum_{i=1}^n (Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i=1}^n (X_i - \bar{X})^2} \\ \\ &= \dfrac{ \frac{1}{n-1} \sum_{i=1}^n (Y_i - \bar{Y})(X_i - \bar{X})}{ \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2} \\ \\ &= \dfrac{S_{XY}}{S^2_X}. \end{aligned} \]

Ex. Effect of police on crime

Using the OLS formulas, we get \(\hat{\beta_1}\) = 18.41 and \(\hat{\beta_2}\) = 1.76.

A quick note on outliers

Outliers

Suppose we added another data point.

Outliers

Fitted line without outlier.

Outliers

Fitted line without outlier. Fitted line with outlier.

Outliers

Because we square residuals, outliers hold greater influence on OLS estimates.


OLS has a constraint. When a data point is far away, it’s residual holds more weight on the determination of estimates the further out it is.


Think of each observation like a lever, each applying force on the estimates. Outliers are like a giant lever which, like in physics, with longer levers, more force can be applied on an object, or in this case, more weight on the constraints

Coefficient interpretation

Interpretation

There are two stages of interpretation of a regression equation

1. Interpret regression estimates in to words

2. Deciding whether this interpretation should be taken at face value


Both stages are important, but for now, we will focus on the former


Recall the OLS estimates from the previous example

Ex. Effect of police on crime

Using the OLS formulas, we get \(\hat{\beta_1}\) = 18.41 and \(\hat{\beta_2}\) = 1.76.

Coefficient Interpretation

How do I interpret \(\hat{\beta_1}\) = 18.41 and \(\hat{\beta_2}\) = 1.76

Recall, the general interpretation of the intercept is the estimated value of \(Y_i\) when \(X_i=0\)

And, the general interpretation of the slope parameter is the estimated change in \(Y_i\) for a marginal increase in \(X_i\)


First, it is important to understand the units:

  • \(\widehat{\text{Crime}_i}\) is measured as a crime rate, the number of crimes per 1,000 students on campus
  • \(\text{Police}_i\) is also measured as a rate, the number of police officers per 1,000 students on campus

Coefficient Interpretation

Using OLS gives us the fitted line

\[ \widehat{\text{Crime}_i} = \hat{\beta}_1 + \hat{\beta}_2\text{Police}_i. \]

What does \(\hat{\beta_1}\) = 18.41 tell us? Without any police on campus, the crime rate is 18.41 per 1,000 people on campus

What does \(\hat{\beta_2}\) = 1.76 tell us? For each additional police officer per 1,000, there is an associated increase in the crime rate by 1.76 crimes per 1,000 people on campus.


Does this mean that police cause crime? Probably not. Why?

OLS Ex.

OLS Application

Suppose we do not yet have an empirical question, but wish to observe the mechanics involved in generating parameter estimates.

Consider the following mini sample \(\{X,Y\}\) data points:

i x y
1 1 4
2 2 3
3 3 5
4 4 8

Regression Model: \(Y_i = \beta_1 + \beta_2 X_i + u_i\)
Fitted Line: \(\hat{Y_i}= b_1 + b_2 X_i\)

Lets calculate the estimated parameters \(b_1\) and \(b_2\) using the OLS estimator

OLS Application

Recall that OLS focuses on minimizing the RSS. We will take four steps.

  1. Calculate the residuals, \(\hat{u}_i = Y_i - \hat{Y_i}\)
  2. Summate the squared residuals, \(RSS = \sum_{i=1}^n \hat{u}_i^2\)
  3. Differentiate for \(\frac{\partial RSS}{\partial b_j}\) such that our number of unknown parameters is equal to the number of partial differentiation equations
  4. Solve for the unknown parameters

We’ll use the mini sample to get an idea of the mechanics involved. Given larger datasets and more covariates, R comes to the rescue.

Warning: Check the second derivatives to ensure global minimums. All the second-order partial derivatives must be greater than zero.

OLS Application

Step 1: Calculate the residuals

\[ \begin{aligned} \hat{u}_1 &= Y_1 - \hat{Y_1} = Y_1 - b_1 -b_2 X_1 \\ \hat{u}_2 &= Y_2 - \hat{Y_2} = Y_2 - b_1 -b_2 X_2\\ \hat{u}_3 &= Y_3 - \hat{Y_3} = Y_3 - b_1 -b_2 X_3\\ \hat{u}_4 &= Y_4 - \hat{Y_4} = Y_4 - b_1 -b_2 X_4 \end{aligned} \]

Plug in values from our given data for \(\{X,Y\}\) \[ \begin{aligned} \hat{u}_1 &= 4 - b_1 -1*b_2 \\ \hat{u}_2 &= 3 - b_1 -2*b_2\\ \hat{u}_3 &= 5 - b_1 -3*b_2\\ \hat{u}_4 &= 8 - b_1 -4*b_2 \end{aligned} \]

Next we’ll square each of these terms and summate for RSS

OLS Application

Step 2: Calculate the RSS

\[ \begin{aligned} RSS &= \sum_{i=1}^{n} \hat{u_i}^2 = \hat{u_1}^2 + \hat{u_2}^2 + \hat{u_3}^2 + \hat{u_4}^2\\ &= (4 - b_1 - b_2)^2 + (3 - b_1 -2b_2)^2 + (5 - b_1 -3b_2)^2 + (8 - b_1 -4b_2)^2\\ & = 114 + 4b_1^2 + 30b_2^2 - 40b_1 - 114b_2 + 20b_1 b_2 \end{aligned} \]

. . .

Recall that OLS minimizes the RSS expression with respect to the specific parameters involved.

To find the values that minimize a particular expression, we need to apply differentiation.

OLS Application

Step 3: Differentiate RSS by parameters

To differentiate by a particular variable, multiply each term by its power value and subtract \(1\) from the power of each of its terms.

e.g. for \(y=2x^3, \partial y / \partial x = 2*3x^{3-1} = 6x^2\)

. . .

\[ \begin{align} \frac{\partial RSS}{\partial b_1} = 0 &\implies (4*2)b_1^{2-1} -(40*1)b_1^{1-1} + (20*1) b_1^{1-1} b_2 = 0 \notag \\ &\implies 8b_1 - 40 + 20 b_2 = 0 \ \ \ \ \ \ \ \ \ \ \ \ \ Eq(1) \end{align} \]

. . .

\[ \begin{align} \frac{\partial RSS}{\partial b_2} = 0 &\implies (30*2)b_2^{2-1} -(114*1)b_2^{1-1} + (20*1) b_1 b_2^{1-1} = 0 \notag\\ &\implies 60b_2 - 114 + 20 b_1 = 0 \ \ \ \ \ \ \ \ \ \ \ \ \ Eq(2) \end{align} \]

OLS Application

Step 4: Solve for parameters

With two unknowns \(\{b_1, b_2\}\) and two equations in which these unknowns satisfied the first order conditions \(\left\{\frac{\partial RSS}{\partial b_1}, \frac{\partial RSS}{\partial b_2}\right\}\), we can solve for our parameters.

How? Substitute one expression into the other.

\[ \begin{aligned} 20 b_2 &= 40 - 8b_1 \implies 60b_2 = 120 - 24b_1\\ &\text{substitute into second equation}\\ Eq(2): \ & 120 - 24b_1 -114 + 20b_1 = 0\\ & 6 = 4b_1 \implies b_1 = 1.5\\ Eq(1): \ & 20b_2 = 40 - 8\times 1.5 = 28 \implies b_2 = 1.4 \end{aligned} \]

OLS Application

OLS would prescribe \(\{1.5, 1.4\}\) for our set of parameter estimates.


Fitting a line through the data points, with the aim of minimizing the RSS, results in the same implied parameters

  • Such parameters will always be estimated computationally

  • We will perform an exercise by hand in PS03 to understand the mechanics underlying the values we hang our hats on